Asymptotic Equipartition Property

We have been looking at the problem of data compression, algorithms for the same as well as fundamental limits on the compression rate. In this chapter we will approach the problem of data compression in a very different way. The basic idea of asymptotic equipartition property is to consider the space of all possible sequences produced by a stochastic (random) source, and focusing attention on the "typical" ones. The theory is asymptotic in the sense that many of the results focus on the regime of large source sequences. This technique allows us to recover many of the results for compression but in a non-constructive manner. Beyond compression, the same concept has allowed researchers to obtain powerful achievability and impossibility results even before a practical algorithm was developed. Our presentation here will focus on the main concepts, and present the mathematical intuition behind the results. We refer the reader to Chapter 3 in Cover and Thomas and to Information Theory lecture notes (available here) for some details and proofs.

Before we get started, we define the notation used throughout this chapter:

Alphabet: $U = {1, 2, \dots, r}$ specifies the possible values that the random variable of interest can take.
iid source: An independent and identically distributed (iid) source produces a sequence of random variables that are independent of each other and have the same distribution, e.g. a sequence of tosses of a coin. We use $U_{1}, U_{2}, \dots iid \sim U$ notation for iid random variables from distribution $U$ .
Source sequence: $U^{n} = (U_{1}, \dots, U_{n})$ denotes the $n$ -tuple representing $n$ source symbols, and $U^{n}$ represents the set of all possible $U^{n}$ . Note that we use lowercase $u^{n}$ for a particular realization of $U^{n}$ .
Probability: Under the iid source model, we simply have $P (U^{n}) = Π_{i = 1}^{n} P (U_{i})$ where we slightly abuse the notation by using $P$ to represent both the probability function of the $n$ -tuple $U^{n}$ and that for a given symbol $U_{i}$ .

As an example, we could have alphabet $U = 0, 1$ and a source distributed iid according to $P (U_{i} = 0) = 0.3 = 1 - P (U_{i} = 1)$ . Then the source sequence $u^{3} = (1, 0, 0)$ has probability $0.7 \times 0.3 \times 0.3 = 0.063$ .

Before we start the discussion of asymptotic equipartition property and typical sequences, let us recall the (weak) law of large numbers (LLN) that says the empirical average of a sequence of random variables converges towards their expectation (to understand the different types of convergence of random variables and the strong LLN, please refer to a probability textbook).

Theorem-1: Weak Law of Large Numbers

For $U_{1}, U_{2}, \dots iid \sim U$ with alphabet $U = {1, 2, \dots, r}$ , we have the following for any $ϵ > 0$ : $n \to \infty lim P (\frac{1}{n} i = 1 \sum n U_{i} - E [U] < ϵ) = 1$

That is for arbitarily small $ϵ > 0$ , as $n$ becomes large, the probability that the empirical average is within $ϵ$ of the expectation is $1$ .

With this background, we are ready to jump in!

The $ϵ$ -typical set

Definition-1

For some $ϵ > 0$ , the source sequence $U^{n}$ is $ϵ$ -typical if $- \frac{1}{n} lo g P (U^{n}) - H (U) \leq ϵ$

Recall that

H (U)

is the entropy of the random variable

U

. It is easy to see that the condition is equivalent to saying that

2^{- n (H (U) + ϵ)} \leq P (U^{n}) \leq 2^{- n (H (U) - ϵ)}

. The set of all **

ϵ

-typical** sequences is called the **

ϵ

-typical set** denoted as

A_{ϵ}^{(n)}

. Put in words, the **

ϵ

-typical set** contains all

n

-length sequences whose probability is close to

2^{- n H (U)}

Next, we look at some probabilities of the typical set:

Theorem-2: Properties of typical sets

For any $ϵ > 0$ ,

$lim_{n \to \infty} P (U^{n} \in A_{ϵ}^{(n)}) = 1$ .
$A_{ϵ}^{(n)} \leq 2^{n (H (U) + ϵ)}$
For large enough $n$ , $A_{ϵ}^{(n)} \geq (1 - ϵ) 2^{n (H (U) - ϵ)}$

Simply put, this is saying that for large $n$ , the typical set has probability close to $1$ and size roughly $2^{n (H (U)}$

Proof sketch for Theorem-2

This follows directly from the Weak LLN by noting the definition of typical sequences and the following facts: (i) $- \frac{1}{n} lo g P (U^{n}) = \frac{1}{n} \sum - lo g P (U_{i})$ (since this is iid). (ii) $H (U) = E [- lo g P (U)]$ from the definition of entropy. Thus applying the LLN on $- lo g P (U)$ instead of $U$ directly gives the desired result.

Both 2 & 3 follow from the definition of typical sequences (which roughly says the probability of each typical sequence is close to $2^{- n H (U)}$ ) and the fact that the typical set has probability close to $1$ (less than $1$ just because total probability is atmost $1$ and close to $1$ due to property 1).

Intuition into the typical set

To gain intuition into the typical set and its properties, let $∣ U ∣ = r$ and consider the set of all $n$ length sequences $U^{n}$ with size $r^{n}$ . Then the AEP says that (for $n$ sufficiently large), all the probability mass is concentrated in an exponentially smaller subset of size $2^{n H (U)}$ . That is, $\frac{∣ A _{ϵ}^{(n)} ∣}{∣ U ^{n} ∣} \approx \frac{2 ^{n H (U)}}{r ^{n}} = 2^{- n (l o g r - H (U))}$

Furthermore, all the elements in the typical set $A_{ϵ}^{(n)}$ are roughly equiprobable, each with probability $2^{- n H (U)}$ . Thus, $U^{n}$ contains within itself a subset that contains almost the entire probability mass, and within the subset the elements are roughly uniformly distributed. It is important to note that these properties hold only for large enough $n$ since ultimately these are derived from the law of large numbers. The intuition into typical sets is illustrated in the figure below.

Quiz-1: Intuition into the typical set

What does the elements of the typical set look like?

Consider a binary source with $P (0) = P (1) = 0.5$ . What is the size of the typical set $A_{ϵ}^{(n)}$ (Hint: this doesn't depend on $ϵ$ !)? Which elements of ${0, 1}^{n}$ are typical? After looking at this example, can you still claim the typical set is exponentially smaller than the set of all sequences in all cases?
Now consider a binary source with $P (0) = 1 - P (1) = 0.2$ . Recalling the fact that typical elements have probability around $2^{- n H (U)}$ , what can you say about the elements of the typical set? (Hint: what fraction of zeros and ones would you expect a typical sequence to have? Check if your intuition matches with the $2^{- n H (U)}$ expression.)
In part 2, you can easily verify that the most probable sequence is the sequence consisting of all $1$ s. Is that sequence typical (for small $ϵ$ )? If not, how do you square that off with the fact that the typical set contains almost all of the probability mass?

Before we discuss the ramifications of the above for compressibility of sequences, we introduce one further property of subsets of $U^{n}$ . This property says that any set substantially smaller than the typical set has negligible probability mass. Intuitively this holds because all elements in the typical set have roughly the same probability, and hence removing a large fraction of them leaves very little probability mass remaining. In other words, very roughly the property says that the typical set is the smallest set containing most of the probability. This property will be very useful below when we link typical sets to compression. Here we just state the theorem and leave the proof to the references.

Theorem-2: Sets exponentially smaller than the typical set

Fix $δ > 0$ and $B^{(n)} \subseteq U^{n}$ such that $∣ B^{(n)} ∣ \leq 2^{n (H (U) - δ)}$ . Then $n \to \infty lim P (U^{n} \in B^{(n)}) = 0$

Compression based on typical sets

Suppose you were tasked with developing a compressor for a source with $2^{k}$ possible values, each of them equally likely to occur. It is easy to verify that a simple fixed-length code that encodes each of the value with $k$ bits is optimal for this source. But, if you think about the AEP property above, as $n$ grows, almost all the probability in the set of $n$ -length sequences over alphabet $U$ is contained in the typical set with roughly $2^{k}$ elements (where $k = n H (U)$ ). And the elements within the typical set are (roughly) equally likely to occur. Ignoring the non-typical elements for a moment, we can encode the typical elements with $n H (U)$ bits using the simple logic mentioned earlier. We have encoded $n$ input symbols with $n H (U)$ bits, effectively using $H (U)$ bits/symbol! This was not truly lossless because we fail to represent the non-typical sequences. But this can be considered near-lossless since the non-typical sequences have probability close to $0$ , and hence this code is lossless for a given input with very high probability. Note that we ignored the $ϵ$ 's and $δ$ 's in the treatment above, but that shouldn't take away from the main conclusions that become truer and truer as $n$ grows.

On the flip side, suppose we wanted to encode the elements in $U$ with $n (H (U) - δ)$ bits for some $δ > 0$ . Now, $n (H (U) - δ)$ bits can represent $2^{n (H (U) - δ)}$ elements. But according to theorem 2 above, the set of elements correctly represented by such a code has negligible probability as $n$ grows. This means a fixed-length code using less than $H (U)$ bits per symbol is unable to losslessly represent an input sequence with very high probability. Thus, using AEP we can prove the fact that any fixed-length near-lossless compression scheme must use at least $H (U)$ bits per symbol.

Lossless compression scheme

Let us now develop a lossless compression algorithm based on the AEP, this time being very precise. As before, we focus on encoding sequences of length $n$ . Note that a lossless compression algorithm aiming to achieve entropy must be variable length (unless the source itself is uniformly distributed). And the AEP teaches us that elements in the typical set should ideally be represented using $\approx n H (U)$ bits. With this in mind, consider the following scheme:

Lossless compression using typical sets

Fix $ϵ > 0$ , and assign index $i d x$ ranging from $1, \dots, ∣ A_{ϵ}^{(n)} ∣$ to the elements in $A_{ϵ}^{(n)}$ (the order doesn't matter). In addition, define a fixed length code $f i x e d$ for $U^{n}$ that uses $n lo g_{2} ∣ U ∣$ bits to encode any input sequence. Now the encoding of $u^{n}$ is simply:

if $u^{n} \in A_{ϵ}^{(n)}$ , encode as $0$ followed by $i d x (u^{n})$
else, encode as $1$ followed by $f i x e d (u^{n})$

Let's calculate the expected code length (in bits per symbol) used by the above scheme. For $n$ large enough, we can safely assume that $P (U^{n} \in A_{ϵ}^{(n)}) \geq 1 - ϵ$ by the AEP. Furthermore, we know that $i d x (u^{n})$ needs at most $H (U) + ϵ$ bits to represent (theorem-1 part 2). Thus, denoting the code length by $l$ , we have $E [l (U^{n})] \leq (1 - ϵ) (1 + n (H (U) + ϵ)) + ϵ (1 + n lo g_{2} ∣ U ∣)$ where the first term corresponds to the typical sequences and the second term to everything else (not that we use the simplest possible encoding for non-typical sequences since they don't really matter in terms of their probability). Also note that we add $1$ to each of the lengths to account for the $0$ or $1$ we prefix in the scheme above. Simplifying the above, we have $\frac{E [ l ( U ^{n} )]}{n} = (1 - ϵ) (H (U) + ϵ) + ϵ lo g_{2} ∣ U ∣ + \frac{1}{n}$ $\frac{E [ l ( U ^{n} )]}{n} = H (U) + O (ϵ) + \frac{1}{n}$ where $O (ϵ)$ represents terms bounded by $cϵ$ for some constant $c$ when $ϵ$ is small. Thus we can achieve code lengths arbitrary close to $H (U)$ bits/symbol by selecting $ϵ$ and $n$ appropriately!

Quiz-2: Lossless compression based on typical sets

Describe the decoding algorithm for the above scheme and verify that the scheme is indeed lossless.
What is the complexity of the encoding and decoding procedure as a function of $n$ ? Consider both the time and memory usage. Is this a practical scheme for large values of $n$ ?

Asymptotic Equipartition Property

The ϵ-typical set

Proof sketch for Theorem-2

Intuition into the typical set

Compression based on typical sets

Lossless compression scheme

The $ϵ$ -typical set